Pronunciation Lexicon Development for Under-Resourced Languages Using Automatically Derived Subword Units: A Case Study on Scottish Gaelic

نویسندگان

  • Marzieh Razavi
  • Ramya Rasipuram
  • Mathew Magimai Doss
چکیده

Developing a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be available, particularly for under-resourced languages. To avoid the need for the linguistic knowledge, acoustic information can be used to automatically obtain the subword units and the associated pronunciations. Towards that, the present paper investigates the potential of a recently proposed hidden Markov model formalism for automatic derivation of subword units and lexicon development on a truly under-resourced and endangered language, more precisely Scottish Gaelic. Our studies show that the formalism can not only be useful in developing a lexicon that helps in building better automatic speech recognition systems, but can also be extended to find the relationship between the derived subword units and the existing knowledge about phonetic units from resource-rich languages, more precisely multilingual phones. Thus, the formalism paves a path for systematically combining acoustic and linguistic knowledge from multiple languages with the limited acoustic and linguistic knowledge of the under-resourced language in order to develop phone-like automatic subword unit based lexical resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards weakly supervised acoustic subword unit discovery and lexicon development using hidden Markov models

State-of-the-art automatic speech recognition and text-to-speech systems are based on subword units, typically phonemes. This necessitates a lexicon that maps each word to a sequence of subword units. Development of a phonetic lexicon for a language requires linguistic knowledge as well as human effort, which may not be always readily available, particularly for under-resourced languages. In su...

متن کامل

Improving Under-Resourced Language ASR Through Latent Subword Unit Space Discovery

Development of state-of-the-art automatic speech recognition (ASR) systems requires acoustic resources (i.e., transcribed speech) as well as lexical resources (i.e., phonetic lexicons). It has been shown that acoustic and lexical resource constraints can be overcome by first training an acoustic model that captures acoustic-to-multilingual phone relationships on languageindependent data; and th...

متن کامل

Stochastic pronunciation modeling by ergodic-HMM of acoustic sub-word units

We propose a stochastic pronunciation model using an ergodic hidden Markov model (EHMM) of automatically derived acoustic sub-word units (SWU). The proposed EHMM discovers the pronunciation structure inherent in the acoustic training data of a word without any apriori phonetic transcriptions. The EHMM is an HMM of HMMs – its states are SWU HMMs and the state-transitions compose various possible...

متن کامل

Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling

Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, ...

متن کامل

Data-driven pronunciation modeling for ASR using acoustic subword units

We describe a method to model pronunciation variation for ASR in a data-driven way, namely by use of automatically derived acoustic subword units. The inventory of units is designed so as to produce maximal separable pronunciation variants of words while at the same time only the most important variants for the particular application are trained. In doing so, the optimal number of variants per ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015